Lesion-conditioning of synthetic MRI-derived subtraction-MIPs of the breast using a latent diffusion model

The purpose of this feasibility study is to investigate if latent diffusion models (LDMs) are capable to generate contrast enhanced (CE) MRI-derived subtraction maximum intensity projections (MIPs) of the breast, which are conditioned by lesions. We trained an LDM with n = 2832 CE-MIPs of breast MRI examinations of n = 1966 patients (median age: 50 years) acquired between the years 2015 and 2020. The LDM was subsequently conditioned with n = 756 segmented lesions from n = 407 examinations, indicating their location and BI-RADS scores. By applying the LDM, synthetic images were generated from the segmentations of an independent validation dataset. Lesions, anatomical correctness, and realistic impression of synthetic and real MIP images were further assessed in a multi-rater study with five independent raters, each evaluating n = 204 MIPs (50% real/50% synthetic images). The detection of synthetic MIPs by the raters was akin to random guessing with an AUC of 0.58. Interrater reliability of the lesion assessment was high both for real (Kendall’s W = 0.77) and synthetic images (W = 0.85). A higher AUC was observed for the detection of suspicious lesions (BI-RADS \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge $$\end{document}≥ 4) in synthetic MIPs (0.88 vs. 0.77; p = 0.051). Our results show that LDMs can generate lesion-conditioned MRI-derived CE subtraction MIPs of the breast, however, they also indicate that the LDM tended to generate rather typical or ‘textbook representations’ of lesions.

segmentation masks 11 .Latent diffusion models (LDM) can be trained efficiently in a lower dimensional "latent space" instead of the high-dimensional pixel-space 11 .Recently, they have also been applied to medical datasets, such as chest X-rays and brain MRI [13][14][15][16] , mostly with the goal to augment datasets for the training of NNs.
In the context of breast MRI, LDMs have been applied, for example, by Khader et al., to pre-train a NN with synthetic data to improve segmentation performance 15 .The group of Graham et al. developed Diffusion denoising probabilistic models (DDPM) for out-of-distribution detection and evaluated them on a medical dataset that included also breast MRI data 17 .Our feasibility study aims at investigating the capability of LDMs to generate synthetic MRI-derived CE-MIPs of the breast that are conditioned by lesions.

Study Sample Characteristics
A total of n = 1966 patients (median age at first examination [IQR]: 50 [IQR: 42 to 59] years) with a total of n = 2832 breast MRI examinations were included in this analysis.Multiple examinations were performed in n = 495 patients (Table 1).The autoencoder NN was trained with all available examination MIPs whereas the LDM was conditioned on a subset thereof for which segmentations were available.The training dataset for conditioning the LDM with the segmented lesions contained n = 407 examination MIPs of n = 338 patients (median age [IQR]: 50 [IQR: 41.50 to 59] years) with a total of n = 756 lesions.The validation dataset consisted of n = 102 examination MIPs of n = 84 patients (n = 193 lesions).According to the histopathology, 160 out of 407 MIPs (39%)in the training dataset contained malignant lesions, whereas in the validation dataset, 37 out of 102 MIPs (36%) contained malignant lesions.Details on the conditioning subset are given in Table 2.

Diffusion model outputs
The model weights from the epoch with the lowest validation loss (epoch 376) were used to generate the synthetic MRI-derived CE-MIPs of the breast.The training-and validation loss curves of the conditioned LDM are given in supplement S5.For demonstration purposes, n = 120 examples of the generated synthetic breast MRI MIPs as well as the segmentation masks used for conditioning the LDM and the corresponding acquired MRI data (GT) are given in Figs. 1 and 2. Regarding sampling diversity, the average Multi-scale structural similarity metric (MS-SSIM) 18 between the n = 10 synthetic MIPs and corresponding real MIP per case in the validation dataset was 0.533 ( ±0.09) on average.FID was 0.215, computed with all n = 1020 synthetically generated images (10 per segmentation mask) and the corresponding n = 102 real MIP images from the validation dataset.In comparison, FID among real images was < 0.001.
Fleiss' Kappa computed as a measure of interrater agreement for the derived binarized outcome presence of any lesions (BI-RADS ≥ 2) was Kappa = 0.13 (p = 0.024) for real MIPs and Kappa = 0.23 (p = < 0.001) for synthetic images (corresponding to 'slight' and 'fair' agreements).The area under the receiver operating characteristics (ROC) curve (AUC) for the detection of any lesions in real MIPs was 0.68 (for the results of the individual raters please refer to supplement S7), whereas in synthetic MIPs the AUC was 0.65 (Fig. 3A) (for the results of the individual raters please refer to supplement S7).DeLong's test showed no significant differences regarding the detection of any lesions in real and synthetic images (p = 0.635).Columns 2-4 of Table 3 show the corresponding contingency table.

Two examinations 281
Three examinations 108

Four examinations 61
Five examinations 41

Six examinations 3
Eight examinations 1 For the presence of potentially significant lesions (BI-RADS ≥ 3), the interrater agreement was Kappa = 0.5 (p = < 0.001) for real MIPs and Kappa = 0.67 (p = < 0.001) for synthetic images (corresponding to 'moderate' and 'substantial' agreements).The AUC for the detection of potentially significant lesions in real MIPs was 0.79 (see supplement S7), whereas in synthetic MIPs the AUC was 0.86 (Fig. 3B) (for the results of the individual raters please refer to supplement S7).DeLong's test showed no significant differences regarding the detection of potentially significant lesions in real and synthetic images (p = 0.205).Columns 5-7 of Table 3 show the corresponding contingency table .With respect to the derived binarized outcome suspicious lesions (BI-RADS ≥ 4 ), the interrater agreement was Kappa = 0.55 (p = < 0.001) for real MIPs and Kappa = 0.74 (p = < 0.001) for synthetic images (corresponding to 'moderate' and 'substantial' agreements).The AUC for detecting suspicious lesions in real MIPs was 0.77 (for the results of the individual raters please refer to supplement S7), whereas in synthetic MIPs the AUC was 0.88 (Fig. 3C) (for the results of the individual raters please refer to supplement S7), corresponding to a not significant difference according to DeLong's test (p = 0.051).Columns 8-10 of Table 3 show the corresponding contingency table.More information on the interrater agreements between individual raters as well as the conditioning evaluation on a per rater level are given in supplements S6 and S7.

Reading task 2: detection of synthetic MIPs
The interrater agreement in the detection of synthetic MIPs was Kappa = − 0.009 (p = 0.682).The contingency table of the combined interrater label and the ground truth (GT) is shown in Table 4.The false negative rate in detecting synthetic MIPs was 66% (67/102) with a specificity of 76% (78/102).The AUC for the detection of synthetic MIPs was 0.58 (Fig. 3D) (for the results of the individual raters please refer to supplement S8).Both, the low interrater agreement ('poor') and the ROC curve indicate that the detection of synthetic MIPs is akin to random guessing.More details regarding the interrater agreements in between individual raters are given in supplement S6.

Reading tasks 3: anatomical correctness
The interrater reliability in the scoring of anatomical correctness was W = 0.33 (p = < 0.001) for real MIPs and W = 0.24 (p = 0.084) for synthetic images (both corresponding to 'fair' agreements).

Discussion
This study demonstrates an LDM that generates synthetic CE subtraction MIPs.The LDM was trained with n = 2832 CE-MIPs of the breast of n = 1966 patients.The conditioning process of the LDM was performed with n = 756 segmented lesions that indicated the underlying BI-RADS class and location, thus implicitly providing information on morphometric characteristics of these lesions.With an AUC of 0.58, the images generated by the LDM were not distinguishable from actual MRI-acquired data by five independent raters.The low MS-SSIM value found in our evaluation might be an indicator supporting this assessment, suggesting that the LDM generates synthetic images with a high diversity when sampling multiple images from the same segmentation mask being used for the conditioning.www.nature.com/scientificreports/According to our multi-rater study, the detection of synthetic MIPs was akin to random guessing.Nevertheless, we found that there may also be a certain training effect in recognizing synthetically generated medical images: R1, R2, R3 and R5 had, according to their own statements, no previous experience with synthetically generated subtraction MIPs of the breast, whereas R4 as the medical supervisor of the experiments for this study has evaluated and reviewed LDM-generated MIPs of the breast already before.This previous experience is also evident in the AUC of 0.64 that was achieved by R4 in detecting synthetic MIPs (see supplement S8).Furthermore, no differences could be observed in the detection of any lesions (BI-RADS ≥ 2).These observa- tions suggest that the amount of training data was sufficient to condition the LDM with lesions.The multi-rater study further showed that both the detection of potentially significant lesions and especially suspicious lesions tended to be better in the synthetic data, however, not reaching statistical significance.This holds true for both the combined interrater labels and on an individual rater level, indicating that the LDM may have learned rather typical or 'textbook representations' of (suspicious) lesions, whereas cases in the acquired MRI data apparently seemed not to be as consistently assignable to the underlying class in our multi-rater study.
We hypothesize that this finding may be related to the manner in which the LDM was conditioned with the segmentations and that larger amounts of training data probably might not remedy this.We assume that there is some heterogeneity in the visual appearance of lesions of a specific BI-RADS class on CE-MIPs with potential overlaps between lesions of different classes (see schema in supplement S10).However, the GT of the lesions depicted on the MIPs was established using the clinical reports, which were based on the full diagnostic multiparametric protocol that contained much more information than visible in the MIPs.As this additional information was naturally lacking during the LDM training, the NN might have inferred general patterns between heterogeneous appearing lesions of the same class.These patterns may be reflected insofar as the NN, when Table 3. Confusion matrix of the combined interrater label vs. ground truth, stratified by real and synthetic images, for any lesions (BI-RADS >=2, columns 2-4), potentially significant lesions (BI-RADS >=3, columns 5-7), and suspicious lesions (BI-RADS >=4, columns 8-10).GT ground truth (defined as described in the "Methods"), BI-RADS breast imaging reporting and data system, MIP maximum intensity projection.www.nature.com/scientificreports/generating synthetic data, could tend to generate lesions that can be assigned more clearly to a particular class.Thus, the LDM might have learned to represent lesions in some ranges with a higher confidence, as reflected by the non-overlapping regions in the distribution curves of lesion appearance from the schema in supplement S10.This might explain the observed differences in the detection of (suspicious) lesions between synthetic and actual MRI-derived breast CE-MIPs.We consider this finding relevant as it indicates the requirement for future research to, first, investigate if this potential limitation can be overcome by more sophisticated conditionings, and second, to further elucidate the effects when using synthetically generated 'textbook-alike' data in potential areas of application such as the augmentation of training data for medical imaging DL tasks.
This feasibility study has several limitations.First, although not statistically significant, according to the multirater study the detection of lesions tended to be better in the synthetic data, pointing towards a confined capability of the trained LDM to generate a dataset that mimics the properties of lesions contained in an actual clinical breast CE-MIP dataset.Future research is required to investigate how the training of LDMs could be improved to better reflect the full spectrum of real-world lesions emphasizing the necessity to represent the diversity of indiscriminate lesions and overcoming the limitation of benefiting from 'textbook representations' during the training process.For example, the conditioning of the lesions could be extended by confidence measures, e.g., reflecting the degree of agreement between multiple raters, or to divide the defined classes into finer segments and explicitly annotate edge cases as such.Furthermore, LDMs could be conditioned with more parameters, including, a greater variety of clinical findings and anatomical heterogeneity, different grades of image quality, breast density, background parenchymal enhancement, scanner related features and the full multiparametric spectrum of breast MRI sequences to enable a more detailed property adjustment when generating synthetic breast MRI datasets.
Second, the fact that our reading study was also performed by two inexperienced raters (R1 and R3) could be used as an argument to question the validity of the results of this study, especially with regard to reading task 1 to categorize breast lesions, which, was performed by a medical research assistant, a breast MRI-experienced resident, and additionally by one board certified radiologist (whereas reading tasks 2-4 were performed by three board certified radiologists alongside the two inexperienced raters).Especially regarding the categorization of breast lesions (reading task 1), we believe that the expressiveness of our results might even benefit from the reading by the more inexperienced raters.While inexperienced raters could certainly have difficulty distinguishing between edge cases, especially when reading MIPs as the only source of information, their performance could also be used as a proxy to make certain assumptions about the representation of findings in the images.So, if Table 4. Confusion matrix of the reading vs. ground truth of the task 2 to decide for each maximum intensity projection (MIP) if it is a real MIP ('0') or a synthetic MIP ('1').GT: ground truth, i.e. if the image was a real MIP or synthetically generated by the latent diffusion model.www.nature.com/scientificreports/hypothetically an inexperienced rater would be able to distinguish better between different classes on a synthetically generated image than on a real image, this could suggest that the process to generate the synthetic images has some properties that results in a representation of classes that allows even the inexperienced rater to distinguish them.For the lesion categorization (reading task 1), the substantial and almost perfect individual interrater agreements between both of the inexperienced raters and the board certified radiologist (see supplement S6) allowed us to further assess the as such labeled images to get a deeper understanding of the LDM's generative capabilities.With regard to the individual readings of the lesion categorization, no significant differences were observed in the detection of any lesions and potentially significant lesions between real and synthetic MIPs for all individual raters (see paragraphs 1-3 of supplement S7).Regarding the detection of potentially significant lesions, the ROC-curves of the individual raters shown in supplemental Figure Supp.3 further indicate that, although not significantly different, regardless of the raters' experience, the AUC was constantly higher for the detection in synthetic MIPs (blue curves) as compared to real MIPs (red curves).With respect to suspicious lesions, while no significant differences between real and synthetic images could be observed for R1 and R2, R3 was able to significantly better detect these lesions on the synthetic MIPs according to DeLong's test for two ROC curves (AUC: 0.86 vs. 0.69, p = 0.001) (see paragraph 3 of supplement S7 and Figure Supp.4).The trend of a better lesion characterization on synthetic MIPs, visible in the ROC-curves computed from the combined interrater labels (Fig. 3), is similar to the trend observed in the individual readings (supplemental figures Supp. 2, Supp.3, and Supp.4).Furthermore, the trend does not seem to depend on the raters' experience, which further supports the validity of the methodological approach as well as the trustworthiness of these results.These results also show that when evaluating synthetic images, it seems to be important to review such images with regard to different aspects, i.e. by providing raters with tasks that focus on specific peculiarities of the images in order to be able to decipher their synthetic origin.Third, the anatomical correctness and realistic image impression were scored significantly lower in the synthetic MIPs in our multi-rater study, indicating that the training of our autoencoder NN could have benefited from additional training data.However, our sample size is comparable to those reported, for example, by Kadher et al. (1250 knee MRI exams, 998 brain MRI exams, 1844 breast MRI exams, and 1010 lung CTs) using LDMs to generate volumetric medical datasets 15 .As a technicalmetric to describe how realistic the synthetic images are, we report the FID metric for our model on this dataset to serve as a potential benchmark for future developments.Future studies and evaluations could include as well extended direct comparisons of LDMs with other image generation techniques such as GAN-based approaches in the context of MRI-derived CE breast MIPs, which, however, was beyond the scope of this present study.Another limitation is that our experiments focused on 2D images with a lower resolution than being used in the clinical setting.To create even more realistic datasets, future works should train LDMs to generate high-resolution 3D volumes.

Reading
The application of LDMs for synthetic data generation in medical imaging is an emerging research area.For example, Pinaya et al. demonstrated the generation of brain MRI datasets conditioned by different anatomical parameters 13 .In another study, Khader et al. applied LDMs to generate computed tomography and MRI sequences of various anatomical regions 15 .The insufficient availability of annotated training data is often an important limitation to DL development in medical imaging 10 .As mentioned, for example, by 13 and demonstrated by Khader et al. 15 , an obvious application of LDMs is the augmentation of medical imaging training datasets.For example, Khader et al. observed an improved segmentation performance when pre-training a NN with synthetic data that was generated by an LDM 15 .Herein, next to providing ad-hoc semantically enriched and large-in principle infinite-datasets, or the augmentation of certain rare cases to reduce bias in machine learning (ML), LDMs might enable as well to improve privacy-preserving approaches for ML algorithms.Thus, our results demonstrate the capability to create synthetic data fitted to a potential clinical high-throughput setting such as (supplemental) MRI in breast cancer screening, in which (a) ML might be of special relevance in the future and (b) large and representative datasets are important to reduce the potential bias of the algorithms.Nevertheless, as the synthetically generated datasets may contain restrictions that could potentially limit the generalizability of the therewith trained NNs, our results suggest that such an application of LDMs should currently be considered carefully, especially when being conditioned with only a limited set of parameters.
In conclusion, our study is among the first to demonstrate an LDM to generate synthetic MRI-derived CE-MIPs of the breast conditioned by lesions.Our multi-rater study further showed that the detection of (suspicious) lesions tended to be better in the synthetic data compared to actual MRI acquisitions, potentially indicating that the LDM might have generated 'textbook representations' of lesions in breast CE-MIPs.Further research is necessary to elucidate this finding and to investigate potential implications when using conditioned LDMs in medical imaging.

Study sample
This retrospective analysis was approved by the ethics committee of the Friedrich-Alexander-University (FAU) Erlangen-Nürnberg, which waived the need for written informed consent.The authors declare that this research was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects.The study period was between October 2015 and June 2020.Within this period, female patients with a clinically indicated breast MRI performed with a full diagnostic protocol including CE sequences at the Institute of Radiology of the University Hospital Erlangen (UHE) were included in this study.The study sample is partially overlapping with previously reported cohorts in which (a) the automated detection of MRI artifacts on breast CE-MIPs by applying DL methods 19 and (b) a DL-based image quality assessment in high b-value diffusion-weighted breast MRI 20 were evaluated, as well as (c) an investigation of the prevalence of MRI-artifacts in breast CE-MIPs 21 .Details on the MRI protocols are given in supplement S1. analogue classifications, Likert-scales), and Fleiss' Kappa 31,32 to assess the interrater agreement of the binary outcomes.Interrater agreements are interpreted according to Landis and Koch 33 .To investigate the LDM's generation capabilities regarding different aspects of the learned conditioning, binary labels were computed from each rater's lesion assessment (reading task 1, lesion assessment according to the BI-RADS classification) in order to label the presence of any lesions (BI-RADS ≥ 2), the presence of potentially significant lesions (BI-RADS ≥ 3), and the presence of suspicious lesions (BI-RADS ≥ 4).For each image, those computed binary labels as well as the binary label from reading task 2 (detection of synthetic MIPs) were aggregated into final interrater labels each by calculating the arithmetic mean between the five raters in order to reflect the degree of agreement or confidence between the raters.Likewise, the Likert-scale based labels from reading tasks 3 and 4 (anatomical correctness and realistic image impression) were also aggregated into final combined labels by calculating the arithmetic mean between the raters for each image.To be able to analyze the binary interrater labels with contingency tables, images with an average interrater score > 0.5 were considered to belong to the class (which corresponds to an aggregation according to the best-of-n method in the case of an uneven number of raters).Differences in the Likert-scaled ratings between real and synthetic MIPs were assessed with Wilcoxon's rank sum test 34,35 .The lesion conditioning capabilities of the LDM were assessed with ROC curves, computed with the derived binary interrater labels and the corresponding ground truth (GT).Differences in the AUC between real and synthetic MIPs were assessed with DeLong's test 36 .Multi-scale structural similarity metric (MS-SSIM) 18 was computed using the implementation from the torchmetrics Python package, version 0.11.1 37 .MS-SSIM, with possible values between 0 and 1, was used in our study to evaluate the generation diversity with lower MS-SSIM values indicating a higher diversity (suggestive of "inventing" new images by the LDM) and higher values indicating the generation of more similar synthetic images.The MS-SSIM values were computed for each case in the validation dataset by generating 10 pairs each with the real MIPs and the 10 sampled synthetic images in order to assess the intra-case generation diversity.The reported value is the arithmetic mean across all accordingly computed MS-SSIM values from the cases of the validation dataset.Additionally, Fréchet Inception Distance (FID) 38 , using the implementation from the torchmetrics Python package, version 0.11.1 37 , was employed to assess if the generated synthetic images stem from a similar distribution as the original images.FID was computed using Inception v3 feature layer 64 and the original weights from 38 .The significance level was set to α=0.05 for all statistical tests.No correction for multiplicity was performed.More details on the statistical analysis are given in the supplement S4.

Figure 1 .
Figure 1.Example images (A) for six cases (BI-RADS 1-6).Row 1 (GT ground truth) shows the acquired breast MRI data with the contrast enhanced maximum intensity projection (MIP) depicted.Row 2 shows the segmentation mask of the lesion from the GT, which was used for conditioning the latent diffusion model.Rows 3-12 show generated synthetic example images (S1-S10).For each BI-RADS class one example image is given in the figure (columns).GT ground truth, BI-RADS breast imaging reporting and data system.

Figure 2 .
Figure 2. Example images (B) for six cases (BI-RADS 1-6).Row 1 (GT ground truth) shows the acquired breast MRI data with the contrast enhanced maximum intensity projection (MIP) depicted.Row 2 shows the segmentation mask of the lesion from the GT, which was used for conditioning the latent diffusion model.Rows 3-12 show generated synthetic example images (S1-S10).For each BI-RADS class one example image is given in the figure (columns).GT ground truth, BI-RADS breast imaging reporting and data system.

Table 1 .
Cohort used for training the auto-encoder.

Table 2 .
Cohort used for training and validating the latent diffusion model.Percentages of the histopathology results are given in relation to all available cases that were classified with the respective BI-RADS score, including those for which the pathology reports were not available.No. number of, N/A not available histopathology results.